1 Spotify Clustering

In this report, I will attempt to do PCA reduction on this Spotify dataset and cluster each song afterwards based on their characteristics.

2 Structure of this report

  • Read data and basic pre-processing
  • PCA reduction:
  • Insights on summary and plot of PCA
  • Uses of PCA
  • Return as dataframe
  • Clustering using K-Means:
  • Finding optimum K using elbow method
  • Clustering and evaluation of cluster
  • Tuning cluster
  • Purpose of clustering:
    1. Cluster Profiling
    2. Song Recommendation

3 Read data and basic pre-processing

I will take only the first 10000 rows because the huge amount of data requires too much computation and my laptop cannot handle it very well.

library(tidyverse)
spotify <- read.csv("SpotifyFeatures.csv", row.names=NULL)
spotify10000 <- head(spotify, 10000)
spotify_clean <- spotify10000 %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(track_name = as.character(track_name)) %>% 
  select(-track_id)
spotify_number <- spotify_clean %>% 
  select_if(is.numeric)

4 PCA reduction

library(FactoMineR)
spotify_pca <- PCA(spotify_number, scale.unit = TRUE, graph = FALSE)
spotify_pca2 <- prcomp(spotify_number, scale. = T)

4.1 Insights of PCA and plot of PCA

summary(spotify_pca)
## 
## Call:
## PCA(X = spotify_number, scale.unit = TRUE, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.711   1.526   1.191   1.120   0.985   0.930   0.850
## % of var.             24.650  13.871  10.825  10.177   8.956   8.453   7.727
## Cumulative % of var.  24.650  38.521  49.347  59.524  68.480  76.934  84.660
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.686   0.471   0.389   0.141
## % of var.              6.237   4.282   3.540   1.281
## Cumulative % of var.  90.897  95.180  98.719 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  4.929 |  0.594  0.001  0.014 |  0.167  0.000  0.001 |
## 2                |  4.028 |  0.160  0.000  0.002 |  1.078  0.008  0.072 |
## 3                |  5.160 | -4.688  0.081  0.825 |  0.890  0.005  0.030 |
## 4                |  5.242 | -3.177  0.037  0.367 | -2.013  0.027  0.148 |
## 5                |  6.406 | -5.242  0.101  0.670 | -0.783  0.004  0.015 |
## 6                |  5.267 | -4.714  0.082  0.801 |  0.742  0.004  0.020 |
## 7                | 10.408 | -3.134  0.036  0.091 |  2.932  0.056  0.079 |
## 8                |  4.124 | -3.388  0.042  0.675 | -0.861  0.005  0.044 |
## 9                |  3.905 | -0.803  0.002  0.042 |  1.654  0.018  0.179 |
## 10               |  3.095 | -0.407  0.001  0.017 |  0.923  0.006  0.089 |
##                   Dim.3    ctr   cos2  
## 1                 1.208  0.012  0.060 |
## 2                 0.751  0.005  0.035 |
## 3                -0.162  0.000  0.001 |
## 4                 0.151  0.000  0.001 |
## 5                 0.195  0.000  0.001 |
## 6                 0.530  0.002  0.010 |
## 7                 6.161  0.319  0.350 |
## 8                -0.135  0.000  0.001 |
## 9                 0.211  0.000  0.003 |
## 10                0.837  0.006  0.073 |
## 
## Variables (the 10 first)
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## popularity       |  0.464  7.956  0.216 | -0.051  0.172  0.003 | -0.284  6.756
## acousticness     | -0.839 25.990  0.705 |  0.061  0.244  0.004 |  0.054  0.243
## danceability     | -0.027  0.026  0.001 |  0.829 45.092  0.688 | -0.038  0.121
## duration_ms      | -0.042  0.064  0.002 | -0.382  9.572  0.146 |  0.412 14.256
## energy           |  0.923 31.432  0.852 | -0.013  0.011  0.000 |  0.090  0.674
## instrumentalness | -0.103  0.391  0.011 | -0.247  3.987  0.061 | -0.129  1.397
## liveness         |  0.132  0.638  0.017 | -0.066  0.288  0.004 |  0.644 34.806
## loudness         |  0.865 27.578  0.748 | -0.050  0.163  0.002 | -0.056  0.265
## speechiness      | -0.010  0.004  0.000 |  0.196  2.512  0.038 |  0.682 39.055
## tempo            |  0.289  3.084  0.084 | -0.264  4.575  0.070 |  0.127  1.359
##                    cos2  
## popularity        0.080 |
## acousticness      0.003 |
## danceability      0.001 |
## duration_ms       0.170 |
## energy            0.008 |
## instrumentalness  0.017 |
## liveness          0.414 |
## loudness          0.003 |
## speechiness       0.465 |
## tempo             0.016 |
spotify_pca2$rotation
##                           PC1         PC2         PC3          PC4          PC5
## popularity        0.282070151 -0.04142711  0.25991994  0.480946305 -0.168524430
## acousticness     -0.509804020  0.04939635 -0.04926554 -0.118826841 -0.052295865
## danceability     -0.016198945  0.67150434  0.03478232  0.255147015  0.045055855
## duration_ms      -0.025270302 -0.30938238 -0.37757534  0.418864601  0.013814015
## energy            0.560640684 -0.01035973 -0.08207478 -0.038132956  0.097616839
## instrumentalness -0.062556989 -0.19966894  0.11817840  0.241592532  0.890209988
## liveness          0.079904988 -0.05368054 -0.58996954 -0.153789724  0.188196360
## loudness          0.525150228 -0.04034814  0.05151931  0.004540255 -0.088889536
## speechiness      -0.006331369  0.15849523 -0.62494273  0.305035466 -0.153970017
## tempo             0.175624678 -0.21389069 -0.11658289 -0.529704217  0.005597191
## valence           0.168381294  0.57780076 -0.10332422 -0.238531173  0.312383767
##                          PC6         PC7        PC8         PC9         PC10
## popularity       -0.02950508  0.45021448  0.5226917  0.33622394  0.032765622
## acousticness     -0.00396675  0.07107606  0.1129885  0.25154508  0.743901950
## danceability      0.08446596 -0.03578955  0.2431472 -0.61647953  0.109700117
## duration_ms       0.29685305 -0.62279352  0.3261231  0.03104665  0.046051677
## energy           -0.02343963 -0.09631607 -0.2064706  0.07960324  0.092039521
## instrumentalness  0.10781974  0.21153464 -0.1092217 -0.07509894  0.121758440
## liveness         -0.64727160  0.15294677  0.3553483 -0.11327919 -0.008070191
## loudness         -0.08073225 -0.12337475 -0.1746696 -0.13406458  0.634227590
## speechiness       0.30422239  0.43483367 -0.4140578  0.08768901  0.006880046
## tempo             0.58870694  0.28298568  0.3774842 -0.24149273  0.041327337
## valence           0.16455283 -0.20323262  0.1525471  0.57782134 -0.063403402
##                          PC11
## popularity       -0.023698688
## acousticness     -0.289618907
## danceability     -0.144438889
## duration_ms      -0.002832397
## energy           -0.774978055
## instrumentalness  0.051436298
## liveness          0.045579018
## loudness          0.489434990
## speechiness       0.081977133
## tempo            -0.004415781
## valence           0.207576879
spotify_pca2$sdev
##  [1] 1.6466586 1.2352536 1.0912351 1.0580680 0.9925793 0.9642913 0.9219191
##  [8] 0.8282730 0.6863430 0.6240054 0.3753269

Based on the summary above, we can tell that for us to have at least 80% of data (which means 20% loss of data), we need at least 7 PCs (PC1 + PC2 +... PC7).

We can also tell that the columns has different weightage in terms of affecting each PC. Energy affects PC1 the most (0.560640684), whereas danceability affects PC2 the most (0.67150434) and so on.

We are also able to tell the eigen value of each PC. PC1 has the highest eigen value of 2.711, PC2 has the second highest eigen value of 1.526, followed by PC3 with 1.191, and so on. This directly corresponds to the amount of data it carries. Since PC1 has the highest eigen value, it also contribute the most data in terms of percentage (24.650%), followed by PC2 (13.871%) and PC3 (10.825%).

plot.PCA(x = spotify_pca, choix = c("ind"), select = "contrib7", habillage = "ind")

Based on the plot above, we can tell that there are several outliers such as data with the index of 343, 97, 451, 471, 284, 133 and 6305. We can further analyse the outliers and see whether the outlier affects PC1 more or PC2 more. For example, data with the index of 6305 and 343 affect PC1 more than PC2 (as seen in its position and the scale of both axes), whereas data with the index of 133 affects PC2 more than PC1.

Next we can analyse the effects of each columns on the PCs.

plot.PCA(spotify_pca, cex=0.6, choix = c("var"))

a <- dimdesc(spotify_pca)

as.data.frame(a[[1]]$quanti) #correlation to PC1
as.data.frame(a[[2]]$quanti) #correlation to PC2

Based on the plot and dataframe above, acousticness, danceability and valence affect PC2 more than PC1, whereas energy, loudness, popularity and tempo affect PC1 more than PC2. From here, we can also tell the collinearity between columns. Energy and loudness have very high positive collinearity whereas popularity and acousticness has very high negative collinearity. This can be seen through the relative position and direction to each column name on the plot.

4.2 Uses of PCA

Besides just to reduce the dimension of data without much loss of the data itself, PCA can be used to tackle the no-multicollinearity assumption needed for predictors when making a linear regression model. This is because by doing PCA, columns (that contain the PCA value) would no longer be collinear to each other.

Below is an example:

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(spotify_number, label = T)

We can see from the plot above that several columns have very high collinearity to each other. For example, loudness and energy has very high positive collinearity (0.8) and energy and acousticness has very high negative collinearity (-0.7). If we were to make a linear regression model out of this dataset, we might not be able to fulfill the assumption of no-multicollinearity between predictors.

Hence, one way to solve this issue would be to do PCA on the data and use that result instead of the original numerical values.

ggcorr(data.frame(spotify_pca2$x), label = T)

As we can see, there is 0 correlation between the PCs, allowing it to fulfill no multicollinearity if it were to be made into a linear regression model. However, there is a caveat: once it is made into a bunch of PCs, we would not be able to interpret the numbers anymore, so use this method sparingly and accordingly.

4.3 Return as dataframe

Since we have decided to accomodate for 20% loss of data, I will subset only the first 7 PCs and return it into a dataframe with the non-numerical data.

pca_keep <- spotify_pca2$x[,c(1:7)] %>% 
  as.data.frame()

spotify_final <- spotify_clean %>% 
  select_if(negate(is.numeric)) %>% 
  bind_cols(pca_keep)

head(spotify_final)

5 Clustering using K-Means

We will now cluster the songs based on their characteristics into several clusters.

summary(spotify_clean)
##          genre                       artist_name    track_name       
##  A Capella  : 119   Chorus                 : 102   Length:10000      
##  Alternative:5054   Henri Salvador         :  88   Class :character  
##  Country    :4162   George Strait          :  68   Mode  :character  
##  Dance      : 113   Five Finger Death Punch:  63                     
##  Movie      : 408   Linkin Park            :  61                     
##  R&B        : 144   Kenny Chesney          :  60                     
##                     (Other)                :9558                     
##    popularity      acousticness        danceability     duration_ms     
##  Min.   :  0.00   Min.   :0.0000014   Min.   :0.0617   Min.   :  18800  
##  1st Qu.: 39.00   1st Qu.:0.0126000   1st Qu.:0.4710   1st Qu.: 189877  
##  Median : 47.00   Median :0.1120000   Median :0.5620   Median : 216760  
##  Mean   : 45.97   Mean   :0.2453824   Mean   :0.5598   Mean   : 226406  
##  3rd Qu.: 54.00   3rd Qu.:0.4222500   3rd Qu.:0.6540   3rd Qu.: 249410  
##  Max.   :100.00   Max.   :0.9950000   Max.   :0.9710   Max.   :3631469  
##                                                                         
##      energy        instrumentalness         key          liveness     
##  Min.   :0.00154   Min.   :0.0000000   G      :1230   Min.   :0.0214  
##  1st Qu.:0.49075   1st Qu.:0.0000000   D      :1179   1st Qu.:0.0985  
##  Median :0.68300   Median :0.0000073   C      :1122   Median :0.1310  
##  Mean   :0.65174   Mean   :0.0335583   A      :1009   Mean   :0.1945  
##  3rd Qu.:0.83600   3rd Qu.:0.0007252   C#     : 891   3rd Qu.:0.2460  
##  Max.   :0.99800   Max.   :0.9840000   E      : 874   Max.   :0.9960  
##                                        (Other):3695                   
##     loudness          mode       speechiness          tempo       
##  Min.   :-29.368   Major:7381   Min.   :0.02230   Min.   : 32.24  
##  1st Qu.: -8.896   Minor:2619   1st Qu.:0.03200   1st Qu.: 96.84  
##  Median : -6.496                Median :0.04190   Median :120.00  
##  Mean   : -7.277                Mean   :0.07557   Mean   :121.72  
##  3rd Qu.: -4.873                3rd Qu.:0.07220   3rd Qu.:142.73  
##  Max.   : -0.259                Max.   :0.96100   Max.   :216.03  
##                                                                   
##  time_signature    valence      
##  1/4:  47       Min.   :0.0000  
##  3/4: 700       1st Qu.:0.3130  
##  4/4:9148       Median :0.4770  
##  5/4: 105       Mean   :0.4893  
##                 3rd Qu.:0.6650  
##                 Max.   :0.9830  
## 

Since the ranges between each column to another varies a lot (max of duration_ms is 3631469 while max of valence is 0.9830), we need to scale the data.

spotify_clusternew <- spotify_clean %>% 
  select_if(is.numeric) %>% 
  scale() %>% 
  as.data.frame()

5.1 Finding Optimum K using elbow method

library(factoextra)
fviz_nbclust(spotify_clusternew, kmeans, method = "wss")

Based on the above graph, we can see that the most suitable k is 5. This is because the drop in the total within sum of square value from 5 to 6 is very low. The lower the within sum of square, the more tight each cluster it to its center.

5.2 Clustering and evaluation of cluster

set.seed(100)
spotify_kmeans <- kmeans(spotify_clusternew, centers = 5)
fviz_cluster(spotify_kmeans, spotify_clusternew, ggtheme = theme_minimal())

spotify_kmeans$betweenss/spotify_kmeans$totss
## [1] 0.3207211

According to the computation above, this cluster is still not very good because the ratio of its between sum of square value (the total distance between each centroid to the center of the whole data) to its total sum of square (the total distance of each data to the center of the whole data) is very low (the closer to 1, the better).

5.3 Tuning cluster

In order to get a more favourable outcome, we will try to change the number of clusters.

We will try with 8 since in the plot of the elbow method, it also has the least drop (from 8 clusters to 9) in total within sum of squares

set.seed(100)
spotify_kmeans2 <- kmeans(spotify_clusternew, centers = 8)
fviz_cluster(spotify_kmeans2, spotify_clusternew, ggtheme = theme_minimal())

As we can see from the above diagram, the clusters have a lot of overlap. This may not be so good, so we can try a smaller number of clusters.

set.seed(100)
spotify_kmeans3 <- kmeans(spotify_clusternew, centers = 3)
fviz_cluster(spotify_kmeans3, spotify_clusternew, ggtheme = theme_minimal())

Evaluation of the two final clustering models in comparison to the original cluster model.

spotify_kmeans$tot.withinss
## [1] 74713.21
spotify_kmeans2$tot.withinss
## [1] 57609.36
spotify_kmeans3$tot.withinss
## [1] 85106.21

A good model has a small total within sum of square. Since total within sum of square measures the total distance between each data in a cluster to its centroid, the smaller the value, the tighter the cluster is, making it more accurate in separating different songs.

From this, we can see that the cluster model that has the least total within sum of square is the model that has 8 clusters.

spotify_kmeans$betweenss
## [1] 35275.79
spotify_kmeans2$betweenss
## [1] 52379.64
spotify_kmeans3$betweenss
## [1] 24882.79

A good model has a large between sum of square. Since between sum of square measures the total distance between each centroid of the clusters to the center of the data, the larger the value, the more distinct each cluster is to one another.

From this, we can see that the cluster model that has the least between sum of square is the model that has 8 clusters.

spotify_kmeans$betweenss/spotify_kmeans$totss
## [1] 0.3207211
spotify_kmeans2$betweenss/spotify_kmeans2$totss
## [1] 0.4762262
spotify_kmeans3$betweenss/spotify_kmeans3$totss
## [1] 0.2262298

From this, we can see that the cluster model that has 8 clusters has a ratio of between sum of square to total sum of square that is closest to 1.

According to the 3 models, the best cluster model goes to the model that consists of 8 clusters, hence we will move forward with that model.

6 Purpose of clustering

6.1 Clustering profiling

spotify_clusternew %>% 
  mutate(cluster = as.factor(spotify_kmeans2$cluster)) %>%
  group_by(cluster) %>% 
  summarise_all(mean) %>% 
  pivot_longer(cols = -c(1), names_to = "type", values_to = "value") %>% #column besides cluster is transformed 
  ggplot(aes(x = cluster, y = value, fill = cluster)) + 
  geom_col() +
  facet_wrap(~type) +
  theme_minimal()

I have break down the clusters in terms of its characteristics so that we can better visualise the different characteristics of each cluster.

From this, we can see that cluster 6 songs have very long average duration as compared to songs in other clusters. Cluster 6 also has the highest average speechiness score and lowest average popularity score as compared to songs in other clusters. On the other hand, cluster 1 songs have the highest average liveliness score as compared to songs in other clusters. Another interesting observation is the cluster 4 and 6 songs have very similar acousticness, energy, loudness and popularity scores, though they differ greatly in terms of duration and danceability score.

6.2 Song recommendation

spotify_clusternew %>% 
  mutate(cluster = as.factor(spotify_kmeans2$cluster)) %>% 
  mutate(track = as.factor(spotify_clean$track_name)) %>% 
  group_by(cluster) %>% 
  arrange(cluster) %>% 
  filter(cluster == "1") %>% 
  select_if(negate(is.numeric))

For example, if a user of Spotify were to like the song "But pour Rudy" a lot, Spotify would be able to recommend him/her songs that are in the same cluster as "But pour Rudy", such as "Flawless Remix" or "Remember Me (Dúo)".

Similarly, if a user of Spotify were to like a song in another cluster, Spotify would be able to use this algorithm to predict songs that suit the user's taste (songs that are in the same cluster as the user's favourite song).